Revisiting comparable corpora in connected space
نویسنده
چکیده
Bilingual lexicon extraction from comparable corpora is generally addressed through two monolingual distributional spaces of context vectors connected through a (partial) bilingual lexicon. We sketch here an abstract view of the task where these two spaces are embedded into one common bilingual space, and the two comparable corpora are merged into one bilingual corpus. We show how this paradigm accounts for a variety of models proposed so far, and where a set of topics addressed so far take place in this framework: degree of comparability, ambiguity in the bilingual lexicon, where parallel corpora stand with respect to this view, e.g., to replace the bilingual lexicon. A first experiment, using comparable corpora built from parallel corpora, illustrates one way to put this framework into practice. We also outline how this paradigm suggests directions for future investigations. We finally discuss the current limitations of the model and directions to solve them.
منابع مشابه
استخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملRevisiting Context-based Projection Methods for Term-Translation Spotting in Comparable Corpora
Context-based projection methods for identifying the translation of terms in comparable corpora has attracted a lot of attention in the community, e.g. (Fung, 1998; Rapp, 1999). Surprisingly, none of those works have systematically investigated the impact of the many parameters controlling their approach. The present study aims at doing just this. As a testcase, we address the task of translati...
متن کاملRevisiting Missing Identity Ring of Iranian cities (A spatial Temporal Analysis of square Elements in Islamic Architecture and urbanism)
From long ago, square has been considered as a space for performing area of cities, and it has been a factor determining the identity of cities through its design and structures. However, with the growth of cities and arrival of modernity management challenges are facing cities. Accordingly, cities have gradually changed into a place for predicting different types of technological, conceptual...
متن کاملLearning Comparable Corpora from Latent Semantic Analysis Simplified Document Space
Focusing on a systematic Latent Semantic Analysis (LSA) and Machine Learning (ML) approach, this research contributes to the development of a methodology for the automatic compilation of comparable collections of documents. Its originality lies within the delineation of relevant comparability characteristics of similar documents in line with an established definition of comparable corpora. Thes...
متن کاملCombining Bilingual and Comparable Corpora for Low Resource Machine Translation
Statistical machine translation (SMT) performance suffers when models are trained on only small amounts of parallel data. The learned models typically have both low accuracy (incorrect translations and feature scores) and low coverage (high out-of-vocabulary rates). In this work, we use an additional data resource, comparable corpora, to improve both. Beginning with a small bitext and correspon...
متن کامل